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Abstract 


This  report  co\'ers  research  activity  under  the  grant  in  place  from  April  1993 
through  March  1994.  Our  research  addressed  algorithms  and  theory  for  stochastic 
learning,  non-linear  extensions  of  principal  component  analysis  (PCA)  for  dimension- 
reduction,  network  pruning,  and  methods  to  incorporate  desired  invariances  into  learn- 

ing.  ^  ^  , 

In  stochastic  learning  we  extended  and  refined  theoretical  analysis  developed  prior 
to  the  grant  period  [1,  2,  for  example],  focusing  on  asymptotic  behavior  in  order  to 
develop  algorithms  that  improve  late-tirne  convergence  rates.  The  resulting  algorithms 
rnodifv  stochastic  gradient  methods  by  implicitly  incorporating  information  about  cost 
function  curvature,  without  computing  large  Hessian  matrices  [3,  4]. 

Our  algorithms  for  dimension  reduction  extend  traditional  PCA  by  developing  non- 
lir.iar  data  models.  Our  algorithms  build  locally  linear  models.  The  advantages  over 
PCA  are  more  accurate  and  compact  data  representations.  The  advantage  over  neural 
net  approaches  is  that  our  techniques  can  be  several  orders  of  magnitude  quicker  to 
train.  Under  the  present  grant  we  improved  the  algorithms’  accuracy,  exercised  them 
on  a  broader  range  of  data  (including  speech  and  image  data),  and  began  to  relate  our 
algorithms  to  Gaussian-rnixture  models  and  use  them  for  classification  [5,  6]. 

Our  work  on  network  pruning  introduced  a  fast  algorithm  for  removing  excess  de¬ 
grees  of  freedom.  Like  all  regularization  techniques  the  aim  is  to  reduce  model  variance 
at  the  cost  of  bias  to  improve  performance  on  out  of  sample  data.  Network  pruning 
teciiiiiques  t'cp'iCaii'c  use  iiic  Hessian  to  quantny  tne  cues  uuiuuUwed  bji  -r-- 

cific  network  tseirnrs.  For  large  networks  this  is  intractable  and  approximations  (often 
peer)  t:  the  Hessian  are  adopted.  Instead,  we  examine  the  correlation  between  node 

1  r'l  rr.ri ; : ;  r;  1*'^ ri  PCA.  ^0  •f'Si' i  Tfld  "c  I?!35  in'rOGUCcG 

removing  degrees  of  freedom.  This  technique  avoids  large  matrix  calculations  and  is 
fast  and  effective  [7]. 

Our  most  recent  work  establishes  a  correspondence  between  two  methods  for  in¬ 
corporating  invariances  into  pattern  recognition.  Ideally  pattern  recognition  machines 
provide  constant  output  when  the  inputs  are  transformed  under  a  group  of  desired 
invariances  (e.g.  translational  and  rotational  invariance  in  machine  vision).  Two 
methods  to  achieve  this  (there  are  others)  are  i)  enhancing  the  training  data  to  in¬ 
clude  examples  of  inputs  transformed  by  elements  of  the  group,  while  leaving  the 
corresponding  targets  unchanged,  and  ii)  adding  to  the  cost  function  a  regularization 
term  that  penalizes  changes  in  the  output  when  the  input  is  transformed  under  the 
group.  Our  work  relates  the  two  approaches,  showing  precisely  the  sense  in  which  the 
regularized  cost  function  approximates  the  result  of  adding  transformed  examples  to 
the  training  data  [Sj. 
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1  Objectives 

Below  we  list  the  objectives  for  the  research  areas  summarized  in  the  abstract: 

1.1  Stochastic  Learning 

•  Refine  the  theoretical  treatment  to  avoid  the  diffusion  approximation  of  the  ear¬ 
lier  work.  This  has  been  accomplished  by  developing  a  perturbation  expansion 
for  the  appropriate  master  equation  and  relating  this  to  van  Karnpen’s  system 
size  expansion  [9]. 

•  Extend  the  anal}'sis  to  algorithms  with  learning  rate  annealing.  To  achieve  con¬ 
vergence  (in  mean  square  or  with  probability  one),  stochastic  gradient  algorithms 
typically  employ  learning  rate,  or  gain,  schedules  that  behave  asymptotically  as 
/^o/t  with  t  the  Iteration  number,  and  jjQ  the  initial  learning  rate.  For  such  sched¬ 
ules  the  rate  at  which  the  logarithm  of  the  squared  weight  error  decays  appears 
to  he  bounded  below  by  l/t.  This  optimal  rate  is  achieved  if  {Jq  >  i^crit  where 

is  determined  by  the  Hessian  of  the  cost  function  at  the  minimum. 

We  extended  our  analysis  to  such  learning  rate  schedules  and  reproduced  knowm 
results  on  asymptotic  convergence  rates  and  distributions. 

•  Extend  the  treatment  of  asymptotics  to  stochastic  gradient  descent  with  momen¬ 
tum.  The  motit'ation  for  this  was  to  de\'elop  algorithm.s  that  insure  the  optimal 
con\'erger:ce  rate  without  kno’v.'ledge  cf  I’he  Hessian.  Ir!i’U!fi\'e!\'.  since  "mornen- 
rurn"  terms  track  previous  updates,  second  differences  of  the  cost  function  are 
implicitly  containeG  in  sucn  algcritlims.  Lsing  tf;c  analysis  as  a  spri.ng-board, 
we  developed  algorithms  that  incorporate  an  adapttvt  rnornenturn  coefficient  and 
succeeds  in  obtaining  optimal  con'''ergence  rates. 

1.2  Non-Linear  Data  Modeling 

•  Further  develop  our  locally-linear  dimension  reduction  algorithms  to  enhance  the 
accuracy  of  the  representations  obtained. 

•  Exercise  the  algorithmis  on  a  broader  range  of  data  (both  speech  and  image  data) 
for  comparison  with  PC  A  and  neural-net-based  approaches. 

The  results  of  these  studies  were  published  in  [5]. 

•  Relate  the  locally  linear  models  to  parametric  distributions,  in  particular  Gaus¬ 
sian  mixture  models,  and  extend  their  use  from  dimension  reduction  to  classifi¬ 
cation.  The  results  of  this  work  will  appear  in  [6]. 

1.3  Network  Pruning 

•  Make  use  ot  the  correlation  in  node  activities  to  prune  ejjecitue  degrees  of  freedom 
from  networks.  This  technique  was  motivated  by  the  desire  to  prune  models 
quickly,  and  without  computation  of  large  Hessian  matrices. 

This  work  was  published  in  [7]  and  was  one  of  fewer  than  %6  of  the  submitted 
papers  chosen  for  oral  presentation  at  the  1993  Neural  Information  Processing 
Systems  conference.  The  technique  has  been  incorporated  into  algorithms  for 
time-series  prediction  in  the  financial  marketplace. 
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1.4  Invariances  in  Learning 

•  De\'elop  the  correspondence  between  data  set  enhancement,  and  regularization 
[10,  for  example],  or  ’’hints’  [ll,  for  example]  as  means  to  provide  invariance  in 
learning  machines. 

This  work  has  been  accepted  for  publication  in  Neural  Computation,  and  will  be 
presented  at  the  1SS4  NIPS  conference  [S], 
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2  Quantitative  Performance  Measures 

1.  Publications  in  Journals  -  2 

(a)  Leen,  T.K.,  A  Coordinate-Independent  Center  Manifold  Reduction,  Physics 
Litiirs,  A-174,  S9-93,  1993. 

(b)  Leen,  T.K.,  Distortions  and  Regularization  in  Pattern  Recognition,  Neural 
Computation ,  to  appear. 

2.  Publications  in  Conference  Proceedings  -  6 

(a)  Orr,  G.B.  and  Leen,  T.K.:  Mornenturn  and  Optimal  Stochastic  Search,  in 
Proceedings  of  the  1993  Cormeciionul  Models  Sum  m  en  School,  M.C.  Mozer, 
P.  Smolensky,  D.S.  Touretzky,  J.L.  Elrnan  and  A  S.  Weigend  (eds.),  Erlbaurn 
Associates,  1993. 

(b)  Leen,  T.K.  and  Orr  G.B.:  Momentum  and  Optimal  Stochastic  Search,  in 
J.D.  Cowan,  G.  Tesauro  and  J.  Alspector  (eds.),  Advances  in  Neural  Infor¬ 
mation  Processing  Systems,  6,  Morgan  Kauffman  Publishers,  1991. 

(c)  Leen,  T.K.  and  Karnbhatia,  N.:  Fast  Non-Linear  Dimension  Reduction,  in 
J.D.  Cowan,  G.  Tesauro  and  J.  Alspector  (eds.),  Advances  in  Neural  Infor¬ 
mation  Processing  Systems,  6,  Morgan  Kauffman  Publishers,  1994. 

(d)  Levin,  A.U.  and  Leen,  T.K.;  Fast  Pruning  Losing  Principal  Components, 
in  J.D.  Cowan,  G.  Tesauro  and  J.  Alspector  (eds.).  Advances  in  Neural 
Information  Processing  Systems.  >1  Morgan  Kauffman  Publishers.  1994. 

This  paper  was  one  of  6%  of  the  submissions  chosen  for  oral  presentation 

r  c','-,  rt  r-a  j-,  r  £ 

i,e)  Karnbhatia.,  N  and  T,een,  T.K.:  Cla.s.sifying  with  Gaussian  Mixtures,  Clus¬ 
ters,  arid  Subspaces,  Advances  in  .\eural  Information  Processing  Systems,  ?, 
to  appear. 

(f)  Leen,  T.K.:  From  Data  Distributions  to  Regularization  in  Invariant  Learn¬ 
ing,  Advances  in  Neural  Information  Processing  Systems,  7,  to  appear. 

3.  Journal  article  in  preparation: 

(a)  Leen,  T.K.,  Orr,  G.B,  and  Moody  J.E.:  Ensemble  Theory  of  Stochastic 
Learning, 

4.  Books  or  book  chapters  -  0 

•5.  Graduate  students  supported  -  Nanda  Karnbhatia  and  Genvieve  Orr.  Both  are 

expected  to  complete  the  Ph.D.  by  June,  1995.  (Partial  support  for  the  PI  and 

srudents  is  from  a  grant  from  the  Electric  Power  Research  Institute.  The  AFOSR 

grant  supported  one  graduate  student  full  time.). 

6.  Postdoctoral  associates  supported  -  0 

7.  External  honors 

la)  The  PI  served  as  theory  program  co-chair  for  the  1993  Neural  Information 
Processing  Systems  conference. 

(b)  The  PI  IS  currently  serving  as  Workshops  Chair  for  the  1994  Neural  Infor¬ 
mation  Processing  Systems  conference. 

(c)  The  PI  was  promoted  to  Associate  Professor  in  July  1993. 

(d)  The  PI  was  recently  invited  to  join  the  editorial  board  of  Neural  Computa¬ 
tion. 
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3  Accomplishments 

We  are  engaged  in  research  in  two  primary  areas:  stochastic  learning,  and  non-linear 
data  modeling  and  dimension  reduction.  Additional  work  on  model  pruning  and  learn¬ 
ing  invariances  was  supported  under  the  grant  and  is  discussed  in  this  report.  Our 
work  is  reported  in  the  NIPS  1993  conference  proceedings  [4,  5,  7]  and  in  the  upcoming 
NIPS  1994  conference  [6,  S].  Some  of  the  work  on  stochastic  learning  also  appears  in 
the  Proceedings  of  the  1993  Connectionist  Model  Summer  School  [3].  The  work  on 
network  pruning  was  also  presented  at  the  NATO  Workshop  on  Statistics  and  Neural 
Networks  [12].  In  addition  to  my  primary  work  discussed  here,  during  the  term  of  the 
grant  I  published  an  article  in  Physics  iet/er.j  outlining  a  new  technique  for  computing 
bifurcations  of  dynamical  systems  [13]. 


3.1  Stochastic  Learning 


Since  the  original  submission  we  have  focused  our  work  on  stochastic  learning  in  two 
areas.  The  first  is  directed  at  overcoming  the  limitations  of  the  diffusion  approximation. 
Tow'ards  this  end  we  developed  a  perturbation  expansion  for  solutions  of  the  Kramers- 
Moyal  equation.  Our  perturbation  expansion  provides  the  probability  density  as  a 
power  series  in  the  learning  rate  //.  Though  independently  developed,  the  perturbation 
expansion  is  intimately  linked  to  \an  Karnpcii  s  s\stem  SiZe  expansion  [9], 

The  second  area  focuses  on  asymptotic  (late-tirne'i  behavior  of  algorithms  with 
learning  rate  schedules  =  fJojt-  W’e  are  de\’eicp!ng  algorithms  that  use  an  cdapitvi 
monuniurn  paramitir  help  achieve  nearit'  optimal  convergence  rate.  Oiir  algoritnrn 


impro\'cs  on  prc\uous  methods  as  it  d 


.  -  -  V,  - 


;s  of  the  cigcn\’aluc  spec¬ 
trum  of  the  Hessian,  as  does  Fabian's  [14]  approach;  neither  does  the  algorithm  rely 
on  measuring  an  au.xiliary  statistic,  as  does  the  approach  developed  by  Darken  and 
fvloooy  [loj. 


Perturbation  Expansion 

In  [2]  we  gave  an  equilibrium  solution  for  the  LMS  algorithm  obtained  in  the  diffusion 
approximation,  and  showed  its  validity  for  small  learning  rates.  In  order  to  obtain 
more  accurate  equilibrium  solutions,  we  have  developed  a  perturbative  expansion  of 
solution.s  to  the  Krarners-Moyal  equation^  The  latter  contain.s  all  the  dynarnic.s  of  the 
probability  density  (we  refer  the  reader  to  the  original  proposal). 

Perturbation  techniques,  familiar  from  classical  and  quantum  physics,  enable  one  to 
construct  approximate  solutions  to  intractable  problems  that  are  similar  to  problems 
that  can  be  solved  in  closed  form.  Here  we  will  develop  a  perturbation  expansion 
for  solutions  to  the  forward  Krarners-Moyal  equation.  We  limit  our  discussion  to  the 
equilibrium  density  for  the  LMS  algorithm,  though  the  technique  extends  to  other 
probleiiiS,  and  to  transient  phenomena  as  well. 

Stationary  densities  for  the  Krarners-Moyal  equations  are  solutions  to 


-V... 

E 


E 

.J.  =  l 


_ ^ _ 

cjijj 


{  (  H,,  H,, . . .  HjX  }  =  Lkm  PsX)  =  0 

(1) 


^This  work  is  included  in  a  long  journal  manuscript  under  preparation. 
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This  is  not  solvable  in  closed  form.  The  idea  behind  the  perturbation  tech¬ 
nique  is  to  write  both  the  operator  and  the  stationary  solution  a.s  a 

power  series  in  a  small  parameter,  and  then  to  solve  the  resulting  expression 
order-by-order  in  this  parameter.  Here,  the  naturally  occurring  perturbation 
parameter  is  the  learning  rate 

Schematically,  the  method  proceeds  as  follows.  We  rewrite  the  operator  Z-a'a/ 
as 

^KM  —  M  (  Zq  -f  M  Z]  -h  p' Z2  -}-...)  ,  (2) 

where  the  Z,  will  be  made  explicit  below.  Next  we  expand  as 

P,  =  +  Ai-P'-"'  +  ...  (3) 

where  the  Phl  are  to  be  determined.  Now,  we  substitute  (2)  and  (3)  into  (l)  to 
obtain 


A^ZoP'-*  + 

Ai-(Z,P'°5  -P  ZoP''>)  -f 

A,WZ3PS°)  +  +  ZoP'^M  -h  ...  =  0  .  (4) 


In  order  for  (4)  to  hold  for  arbitrary  }.i,  the  coefficients  of  each  power  of  p.  must 


ZoP^'^  =  0 
ZoP'^’  =  -ZiP'°' 

ZoP''-'  =  -(Z.P'P'  -P  ZiPZl)  etc  (5) 


The  strategy  is  to  successively  solve  each  equation  in  (5).  The  key  is  to  obtain 
a  representation  tor  Za'a/  such  that  Zq  is  an  operator  with  a  known  complete  set 
of  eigenfunctions.  Each  of  the  P-  is  then  expanded  in  terms  of  the  eigenfunctions 
of  Zq.  P*°^  is  simply  the  kernel  of  Zq- 

Let  us  assume  that  we  have  such  a  representation  for  Zq  with  eigenvalues  —A 
and  eigenfunctions  Fx 

Zo  Fx  =  -A  Fx  .  (6) 


The  adjoint  ot  Zq.  denoted  Zq,  has  eigenvalues  —A  and  eigenfunctions  G',\ 


zj  Gx  —  —A  Gx 


(^) 


It  is  easy  to  show  that  the  two  sets  of  eigenfunctions  can  be  chosen  to  be  bi- 
orthogonal 

I  d'ujGx'iu!)  P\(u/-)  =  (G,v,Pa)=  ■  (8) 

Finally,  given  these  basis  eigenfunctions,  the  P^d  that  satisfy  (5)  can  be  ex¬ 
panded  as 


^  i(G,v,/.,P'”')fA  (W) 

.\  =  0  '' 

pr-)  =  Y.\  {iGx,L,P^°^)P  etc.  (11) 

having  used  eigenequations  (6)  and  the  bi-orthogona!ity  relation  (8). 

l\'e  have  applied  the  technique  outlined  above  to  compute  equilibrium  den¬ 
sities  for  the  LMS  algorithm  with  targets  generated  by  a  noisy  teacher  neuron. 
W’e  denote  the  displacement  irom  the  optimal  weight  u/.  by  v  =  .  It  is 

convenient  to  make  the  change  of  variables 

V  =  y  \JjdG-  (12) 

where  y  is  the  learning  rate  and  a~  is  the  variance  of  the  teacher  noise. 

In  the  y  coordinates,  the  first  two  operators  in  (2)  are 


Lq  =  R  (^Byy  +  id;) 

L,  =  ^R-  opr  -  -  ^ry 


where  R  is  tlie  input  correlation.  Note  tnat  Lq  corresponds  to  an  Ornstein- 
Uhlenbeck  process.  Its  eigenfunctions  are  Gaussians  multiplied  by  Hermite  poly- 
nom.ia  Is.  The  first  two  are 


pio]  ^ 

V" 

=  lR'^-\^-2y-)P'^  . 

A  graphical  comparison  shows  that  the  perturbative  solutions  are  superior  to 
those  obtained  from  the  Fokker-Planck  equation.  In  Figure  1  we  compare  the 
Fokker-Planck.  perturbative  and  experimentally  derived  densities  for  1-D  LMS. 


Figure  1:  Simulated  (dashed),  Fokker-Planck  (dotted),  and  perturbative  (solid) 
densities  for  1-D  LMS  with  gaussian  inputs  (p  =  0.05,  a'  =  1,  and  R  =  4).  a) 
O'^'-order:  b)  P'-order:  -b  pP-') 


The  perturbation  solutions  have  been  carried  to  higher  order  and  applied  to 
multidimensional  problems  as  well. 

Our  perturbation  technique  is  intimately  linked  to  Van  Kampen’s  system 
size  expansion  [y]  for  the  Kramers-Moya!  equation.  To  apply  Van  Kampen’s 
expansion  to  neural  net  learning,  one  writes  the  weight  error  v  as  the  sum  of 
deterministic  plus  fluctuation  pieces 


V  =  o  +  y/fJ  y  .  (13) 

The  deterministic  piece  6  evolves  by  descent  along  the  average  or  true  gradient, 
approaching  zero  exponentially.  At  late  times,  only  the  dynamics  of  the  fluctu¬ 
ations  y  remain.  These  are  described  to  lowest  order  by  Lq  (and  this  is  what 
\an  Kampen  treats).  Our  perturbation  expansion  extends  the  description  to 
arbitrary  order. 


Asymptotics  and  Optimal  Convergence 

Learning  rate  schedules  of  the  form  y  =  yo/t  give  rise  to  weight  vector  sequences 
that  converge  in  mean  square  to  local  optima  (m..  The  asymptotic  rate  of  con¬ 
vergence  is  conveniently  characterized  by  the  expected  squared  weight  error  (or 
misadjustment)  £[ifi-]  =  £  [ju’  It  is  well  known  [15,  16,  and  refer- 

. -  -  K  .  1  .  U  ,  »  V  U  .  ^  .  j  .  »  1  i.  1  1  .1 
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where  X„„n  is  the  smallest  eigenvalue  of  the  cost  function’s  Hessian  evaluated  at 
a-'..  If  po  >  l-‘c  then  the  misadjustment  falls  off  asymptotically  as  l/t.  whereas  if 
yo  <  yc  the  misadjustment  falls  off  s/ou-er  than  l/t. 

To  achieve  the  optimal  rate,  one  must  estimate  yc  and  adjust  yo  to  be  larger 
{yo  —  l/Xmin  is  optimal).  Fabian  [14]  estimates  the  Hessian  during  the  op¬ 
timization,  and  uses  the  estimate  to  readjust  yo.  This  is  clearly  not  feasible 
tor  high-dimensional  optimization  problems.  Darken  and  Moody  [15]  measure  a 
statistic  that  characterizes  the  roughness  of  trajectories,  and  use  the  time  evolu¬ 
tion  of  this  statistic  to  adjust  yo-  This  approach,  though  less  storage  intensive 
than  estimating  the  Hessian’s  eigenvalue  spectrum,  requires  computation  not 
central  to  the  search  process, 

We  are  proposing  an  alternative  solution  based  on  stochastic  gradient  descent 
with  momentum.  Using  the  Kramers-Moyal  expansion  we  have  rederived  the 
classic  results  on  convergence  rates  (and  related  results  on  asymptotic  normality) 
and  extended  them  to  stochastic  gradient  descent  with  m.omentum  [4,  3].  The 
analysis  shows  that  at  late  times,  learning  is  governed  by  an  effective  learning 
rate 

^eff  ^ 

where  jJ  is  the  coefficient  of  the  momentum  term 


(15) 


Adaptive  Momentum  Improves  Convergence 

Based  on  our  work  on  asymptotic  convergence  with  momentum,  we  have  devised 
an  algorithm  that  forcest  the  expected  squared  misadjustment  £^[lrp]  to  fall  off 
nearly  as  Ijt  at  late  times.  The  new  algorithm  does  not  involve  estimating  Hes¬ 
sian  and  Its  eigenspectrum ,  nor  does  it  involve  calculating  an  auxiliary  statistic, 
nor  does  it  involve  setting  ^io-  Instead  the  momentum  parameter  p  is  adapted 


on-line. 

For  simplicity,  consider  momentum  gradient  descent  in  1-D.  Based  on  the  crit¬ 
ical  value  for  the  learning  rate  (14)  and  the  effective  learning  rate  with  m.omen- 
tum  (15).  ir  is  clear  that  if  the  momentum  coefficient  were  set  to  B  =  l-/io-R.  {R 
the  Hessian]  then  we  would  achieve  the  optimal  convergence  rate  E[\'^  -  cc 
Ijt. 

Of  course  we  do  not  know  R,  but  it  can  be  estimated  on-line.  For  LMS, 
R  —  E[X  A’^]  and  a  convenient  estimate  is  R  =  A’;  Xj' .  For  bounded  inputs, 
one  can  show  that  an  algorithm  based  on  this  choice  achieves  the  optimal  l/t 


convergence  rate. 

Figure  2  shows  the  results  of  our  adaptive  momentum  algorithm  on  a  2-D  LMS 
problem.  The  plots  show,  on  a  log-log  scale,  the  expected  squared  misadjustment 
(computed  from  an  ensemble  of  1000  networks)  as  a  function  of  time.  Optimal 
convergence  with  oc  l/t  corresponds  to  slope  -1  on  these  curves.  The 

correlation  eigenvalues  for  the  input  data  are  Aj  =  0.4,  A;  =  4.0,  so  =  l.'J5. 
The  olct  on  the  left  shows  the  convergence  ot  standard  LMS  with  learning  rate 
fJo/f  for  the  three  initial  rates  /in  —  (1  5. 1  .P.  0.25).  Only  the  curve  correspo.naing 
to  po  =  1-0  exhibits  the  optimal  convergence  rate.  Ihe  plot  on  the  right  shows 
the  results  with  adaptive  momentum.  Notice  that  regardless  ot  po,  at  late  times 
all  the  curves  exhibit  the  optimal  convergence  rate. 


Logllvl^l  Log[lv|2] 


Figure  2  Expected  squared  rnisadjuslment  for  an  enserrible  of  1000  LMS  net- 
u'ork?  Correlation  eigenvalues  -Xj  =  0.4.  A;  =  4.0.  LEFT  -  no  rnornenturn, 

RIGHT  -  adaptive  inornenturn. 

Similar  results  are  obtained  for  problems  with  larger  condition  numbers.  Fig¬ 
ure  3  shows  simulation  results  without  momentum,  and  with  adaptive  momentum 
for  condition  number  of  s;  10'*.  In  this  simulation,  the  annealing  and  adaptive 
momentum  are  started  at  late  times.  Without  momentum,  the  convergence  is 
stalled  -  at  late  times  the  slove  of  the  error  curve  is  essentially  zero.  With  adap¬ 
tive  momentum,  the  asymptotic  slope  is  ~  —0.66.  Although  this  is  not  the 
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theoretical  optimal  value  (slope  —1),  the  improvement  in  convergence,  relative 
to  to  no  momentum  is  substantial. 


Figure  3  Simulation  of  4-D  LMS  for  condition  number  se  10'*. 

Further  work  on  these  algorithms  integrates  automatic  passage  from  con¬ 
stant  learning  rate  to  annealing.  We  are  currently  extending  the  algorithm  to 
r.c.n-linear  optimization  problems  and  tvd!  start  ^x^rntin"  it  on  larse  speech- 

r  I  O  O  i 

recognition  networks. 

3.2  Non-Linear  Dimension  Reduction 

This  work  is  aimed  at  developing  algorithms  for  data  modeling  and  compression 
that  perform  better  than  linear  techniques,  and  are  fast  to  train.  Initially  we 
have  approached  the  problem  of  data  dimension  reduction.  Kramer  [17],  DeMers 
[18],  and  Oja  [19]  have  proposed  the  use  of  5-layer  feed-forward  auto-associative 
networks  with  a  bottle-neck  middle  layer  to  perform  non-linear  dimension  reduc¬ 
tion.  These  networks  have  input  and  output  layers  of  size  equal  to  the  dimension 
of  the  data.  There  are  three  layers  of  hidden  nodes.  The  number  of  nodes  in 
the  second  hidden  la,ver  is  equal  to  the  dimension  of  the  encoded  signal.  The 
networks  are  trained  to  perform  an  identity  transformation  on  the  input  data. 
After  training,  the  low-dimensional  encoding  is  extracted  from  the  activities  of 
the  nodes  in  the  second  hidden  layers.  These  networks  are  able  to  provide  more 
accurate  encodings  than  the  principal  component  analysis  (PCA).  However  they 
are  slow  to  train  and  are  prone  to  trapping  in  poor  solutions. 

As  described  in  the  original  proposal,  we  have  developed  an  algorithm  that 
uses  local  PCA  to  reduce  data  dimension.  The  algorithm  partitions  the  space 
using  a  vector  quantizer  (VQ)  and  then  performs  a  PCA  projection  within  each 
of  the  Voronoi  cells  of  the  partition.  The  PCA  captures  the  local  structure  of 
the  data,  while  the  distribution  of  V’oronoi  cells  captures  the  global,  non-linear 
structure  of  the  data. 
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We  oi-igina!ly  exercised  the  algorithm  on  speech  data  and  compared  its  per¬ 
formance  with  global  PCA.  and  with  the  global  non-linear  model  implemented 
bv  a  5-layer  auto-associativ^e  network  ['dUj.  The  original  data  consists  of  '62  DFT 
coefficients  (spanning  the  frequency  range  0-4kHz)  of  the  monothongal  vowels 
extracted  from  continuous  speech.  The  goal  was  to  reduce  the  data  to  a  low 
(tw'o-  or  three-dimensional)  representation.  The  figure  of  merit  for  these  ex¬ 
periments  w'as  the  error  incurred  in  reconstructing  the  original  signal  from  the 
dimension- reduced  representation. 

Both  the  neural  network  and  the  local-linear  technique  provided  lower  recon- 
.strucrion  crroi-  rh-on  PCA.  The  local  linear  rcchnieiue  provided  ronghly  a  40% 
decrease  in  error  relative  to  the  neural  net,  and  trained  up  to  an  order  of  mag¬ 
nitude  faster  [‘JUj. 


Under  the  present  grant,  we  exercised  the  algorithm  on  image  data,  again 
comparing  its  performance  with  a  neural  network,  and  with  PCA  [2l].  In  those 
experiments,  we  used  the  image  database  developed  by  DeMers  and  Cottrell, 
(Comparing  our  results  with  their  study  of  dimension  reduction  with  neural  net¬ 
works  [22].  The  original  64x64  images  were  first  encoded  into  50  principal  com¬ 
ponents.  This  50-dimensional  representation  was  then  reduced  to  5  dimensions 
using  either  of  the  nonlinear  techniques.  A  linear  .reduction  to  5  dimensions  was 
provided  by  retaining  only  the  leading  5  principal  components.  As  with  the  ex¬ 
periments  on  speech  data  both  non-lmear  techniques  outperlormed  PCA.  Our 
Inc.’iiy  line.^r  .'^ignrirhm  nnrprrformed  rhe  nenr.-^l  nerwork  [‘2l].  For  ex.^imple,  one 
of  rhe  neural  network  models  achieved  a  normalized  reconstruction  error'  of  0.07 
and  required  31,9S0  epu  seconds  :o  Crain  3.  A  comparable  locally  linear  m.odel 

(normalized  error  0.0696)  required  only  50  epu  seconds  to  train. 

Sample  resales  of  the  image  reconstruction  are  shown  in  figure  3  The  images 
clearly  portray  both  the  superiority  of  the  non-linear  techniques  over  PCA,  and 
the  superiority  of  our  locally  linear  technique  over  the  neural  network  (mouth 
shape  is  especially  revealing  in  this  series). 


Figure  3:  Two  representative  images:  Left  to  right  —  Original  60-PO  image,  reconstruction 
from  5-D  encodings:  PC.A.,  Best  neural  network  model,  locally  linear  model  with  10  partitions, 
locally  linear  model  with  50  partitions. 

-The  error  measure  is  the  mean  squared  reconstruction  error  divided  by  the  mean  squared  signal  strength. 
^Using  a  five-layer  autoassociative  network,  DeMers  and  Cottrell  [22]  obtain  an  normalized  reconstruction 
error  of  0.1317  for  the  same  data.This  is  comparable  to  our  results. 
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Coupling  the  Partition  to  the  Projection 

The  algorithm  discussed  above,  though  very  effective,  is  not  optimal.  The  algo¬ 
rithm  first  partitions  the  input  space  into  disjoint  Voronoi  cells  using  a  standard 
VO  algorithm  (e.g.  LBG'^  or  competitive  learning).  After  this  partition  is  built, 
the  local  PCA  projection  is  performed  within  each  cell  of  the  partition.  Thus 
the  initial  partitioning  is  independent  of  the  projection  that  follows. 

One  would  expect  to  obtain  lower  distortion  in  the  final  representation  if  the 
partition  was  built  in  a  manner  that  was  determined  in  part  by  the  projection. 
Indeed  the  two  steps  can  be  coupled  by  building  the  partition  so  as  to  minimize 
ihe  reconstrnrjion  error  after  the  PCA  project  ton.  Either  gradient  descent  or 
LBG-like  algorithms  can  be  built  using  this  distortion  measure.  We  have  coded 
such  algorithms  and  initial  experiments  indicate  a  10-20%  reduction  in  error  rela¬ 
tive  to  the  original  algorithm  in  which  the  partition  is  constructed  independently 
of  the  projection.  More  detail  is  available  in  [oj. 

3.3  Network  Pruning 

Fitting  model  complexity  to  data  is  one  of  the  outstanding  problems  in  statisti¬ 
cal  model  building.  Rich  models,  those  with  many  parameters,  allow  close  fits  to 
sample  data,  but  may  perform  badly  on  out-of-sample  data  (so-called  general¬ 
ization  pertormance i.  To  counter  this  over-fitting,  various  tecimiques  have  been 
introduced  that  decrease  model  variance  at  the  expense  of  model  bias  in  order 
to  improve  perlormance  on  oui-ol-sample  oata. 

♦  Regularization  schemes  (e.g.  weight  decay)  add  a  penalty  term  to  the  cost 
function.  The  proper  coefficient  for  this  term  is  not  known  a  priori,  so  one 
must  perform  several  optimizations  with  different  values;  a  cumbersome 
process. 

«  Weigkt-eltminaiion  schemes  (e.g.  optimial  brain  damage  [24]  and  its  deriva¬ 
tive  optimal  brain  surgery  [25])  involve  traning  large  nets  and  then  removing 
the  weights  that  least  affect  the  training  error.  These  techniques  require 
-  calculating  the  Hessian  or  some  approximation  to  it.  Calculating  the  full 
Hessian  is  impractical  for  large  nets,  and  the  approximations  are  often  poor. 

♦  Early  stopping  monitors  the  error  on  a  validation  set  and  halts  learning 
when  this  error  starts  to  increase. 

We  have  developed  an  alternative  technique  that  uses  principal  component 
analysis  (PCA)  in  conjunction  with  supervised  learning.  Briefly  stated,  the  tech¬ 
nique  uses  PCA  to  decorrelate  node  activities  and  then  eliminates  the  (decor- 
related)  degrees  of  freedom  that  have  the  least  effect  on  the  output  error  (least 
bias).  The  technique  is  fast  to  implement  and  achieves  good  results  on  both 
linear  and  non-linear  models.  Our  paper,  delivered  as  an  oral  presentation  at 

■’The  Linde-Buzo-Gray  algorithm  [23,  and  references  therein]  is  the  cornrnonly-used  batch-mode  algorithm 
for  designing  a  vector  quantizer. 


13 


the  1993  NIPS  meeting,  gives  more  details  of  the  algorithm  implementation  and 
results  on  several  sample  problems. 


3.4  Learning  Invariances 

In  machine  learning  one  sometimes  wants  to  incorporate  invariances  into  the 
function  learned.  Our  knowledge  of  the  problem  dictates  that  the  machine  out¬ 
puts  ought  to  remain  constant  when  its  inputs  are  transformed  under  a  set  of 
operations  C/^.  In  character  recognition,  for  example,  we  want  the  outputs  to  he 
invariant  under  shifts  and  small  rotations  of  the  input  image. 

There  are  several  ways  to  achieve  this  invariance 


1.  The  invariance  can  be  built  into  the  input  representation.  In  image  process¬ 
ing  the  use  of  Fourier  amplitude  coefficients,  rather  than  pixel  intensities, 
provides  invariance  under  translations. 

2.  In  neural  networks,  the  invariance  can  be  hard-wired  by  weight  sharing  in 
the  case  of  summation  nodes  [26]  or  by  constraints  similar  to  weight  sharing 
in  higher-order  nodes  [27]. 


3.  One  can  enhance  the  training  ensemble  by  adding  examples  of  inputs  trans¬ 
formed  under  the  desired  invariance  group,  while  maintaining  the  same  tar¬ 
gets  as  for  the  raw  data. 

4.  One  can  add  to  the  cost  function  a  regu’anzer  that  penalizes  changes  in  the 
outDut  when  the  input  is  trans-ormed  by  elements  oi  the  group  [lu,  llj. 


,  t .  1, .  1 ;  „  '  — ) 

iLcit  Uiir.cQ. 


Intuitively  one  expects  the  approacnes  in  3  and  4  to  be  intim; 

This  link  is  established  by  writing  the  probability  distribution  for  the  en¬ 
hanced  training  set  in  terms  of  the  original  distribution  and  the  distortions  in¬ 
troduced.  These  transformations,  or  distortions,  of  the  inputs  are  carried  out 
by  group  elements  g  £  G.  For  Lie  groups®,  the  transformations  are  analytic 
functions  of  parameters  a  € 


X  x  =  g{x-,  a)  . 


(16) 


with  the  identity  transformation  corresponding  to  parameter  value  zero 


5tx;0)  =  X 


(I'J 


By  adding  distorted  input  examples  we  alter  the  original  density  p(x). 
characterize  the  frequency  with  which  different  distortions  are  represented  in  the 
enhanced  ensemble  by  a  probability  density  over  group  parameters  p(q).  With 
this  density,  the  distribution  for  the  distortion-enhanced  input  ensemble  becomes 

p(x')  =  I  J  dadx  p{x'\x,a)  p{a)  p{x) 

=  J  J  dadx  S{x' -  g{x-,Q))  p{a)  p{x)  , 

We  assume  that  the  set  forms  a  group. 

See  for  example  [28]. 
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where  is  the  Dirac  delta  function. 

Finally  we  impose  that  the  targets  remain  unchanged  when  the  inputs  are 
transformed  according  to  (16)  i.e.,  p{t\x')  =:p(t|r). 

In  this  framework,  the  cost  function  for  the  distortion-enhanced  training  data 
is  shown  to  be  equivalent  to  the  cost  function  for  the  original  (untransformed) 
data,  plus  a  regularizer  term. 

For  unbiased  models,  the  regularizer  is  shown  to  reduce  to  a  simple  penalty 
for  violation  of  the  desired  invariance: 

t'a^jdapia)  dx  p{x)  [f{x,w)-f{g{x]a)]w)]-  (18) 

Our  publications  [8]  contain  further  detail. 
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